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We show that the thermal subadditivity of entropy provides a common basis to derive a strong 
form of the bounded difference inequality and related results as well as more recent inequalities 
applicable to convex Lipschitz functions, random symmetric matrices, shortest travelling sales- 
men paths and weakly self-bounding functions. We also give two new concentration inequalities. 
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1. Introduction 

Concentration inequalities bound the probabilities that random quantities deviate from 
their average, median or otherwise typical values. They are at the heart of empirical 
science and play an important role in the study of natural and artificial learning systems. 

An early concentration inequality for sums was given by Chebychev and Bienaymc 
in the 19th century [6] and allowed a rigorous proof of the weak law of large numbers. 
The subject has since been developed by Bernstein, Chernoff, Bennett, Hocffding and 
many others [1, 9], and results were extended from sums to more general and complicated 
nonlinear functions. During the past few decades, research activity has been stimulated 
by the contributions of Michel Talagrand [22, 23] and by the relevance of concentra- 
tion phenomena to the rapidly growing field of computer science. Some concentration 
inequalities, like the well-known bounded difference inequality, have become standard 
tools in the analysis of algorithms [19]. Nevertheless, a unified and elementary basis for 
the derivation of the many available results is still missing. 

One of the more recent methods used to derive concentration inequalities, the so-called 
entropy method, is rooted in the early investigations of Boltzmann [2] and Gibbs [7] into 
the foundations of statistical mechanics. A general problem of statistical mechanics is to 
demonstrate the "equivalence of ensembles" , which can be interpreted as an exponential 
concentration property of the Hamiltonian, or energy function. While the modern entropy 
method evolved along a complicated historical path via quantum field theory and the 
logarithmic Sobolev inequality of Leonard Gross [8] , its hidden simplicity was understood 
and emphasized by Michel Lcdoux, who also recognized the key role that the subadditivity 
of entropy can play in the derivation of concentration inequalities [10, 11]. Recently, 
Boucheron et al. [4] showed that the entropy method is sufficiently strong to derive 
a form of Talagrand's convex distance inequality. 
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The purpose of this paper is to advertise the subadditivity of entropy as a unified basis 
for the derivation of concentration inequalities for functions on product spaces and to 
demonstrate the benefits of formulating the concentration problem in the language of 
statistical thermodynamics, an approach proposed by David McAllester [18]. 

Our method consists of three steps. The first step (Theorem 1) expresses the log- 
Laplace transform (or, more directly, the deviation probability) in terms of an integral 
of the thermal entropy over a range of inverse temperatures. This step encapsulates the 
so-called Herbst argument. 

The second step (Theorem 6) is the tensorization inequality, or, more properly, a ther- 
mal subadditivity property of entropy. It asserts that the entropy of a system is no greater 
than the thermal average of the sum of entropies of the constituent subsystems. 

The third step (Theorem 3) expresses the entropy of the subsystem in terms of thermal 
energy fluctuations. 

All three steps are elementary and their combination leads to a general concentra- 
tion result (Theorem 7) that can be used whenever we succeed in controlling the latter 
fluctuations. 

We then use the method to first derive a strong form of the bounded difference in- 
equality and an inequality given by McDiarmid and related to Bennett's inequality [19]. 
These results are normally not associated with the entropy method. Then monotonicity 
properties of thermal energy fluctuations, or bounds thereof, are exploited to derive two 
apparantly novel sub-Gaussian tail-bounds and to give a new proof of an upper tail- 
bound in [16] that improves on some results obtained from Talagrand's convex distance 
inequality. Finally, we show how our method can be extended in a generic way using self- 
boundedness and/or decoupling, and illustrate this extension by deriving a concentration 
inequality that underlies the recent new proof of the convex distance inequality [4] . 

Clearly statement and proof of all the results presented in this paper would be possible 
on a purely formal basis without any reference to physics. The author believes, however, 
that positioning the subject in a broader scientific context highlights its historical con- 
nections and gives access to a valuable source of intuition. 

In the next section, we describe the connection between entropy and concentration and 
introduce several thermodynamic functions. We then transfer these concepts to product 
spaces and present the tensorization inequality. The remaining sections are dedicated to 
applications, and we conclude with a tabular summary of the notation used in the paper. 

2. Entropy and concentration 

Let (i7, be a probability space and / G LooM be a fixed function whose concentra- 
tion properties are to be studied. 

We interpret the points x £ ft as possible states of a physical system and / as the 
negative energy (or Hamiltonian) function, so that — /(x) is the system's energy in the 
state x. The measure \i models an a priori probability distribution of states in the absence 
of any constraining information. 

We will ignore questions of measur ability. If it seems necessary to the reader, Q may 
be taken as a potentially very large finite set, the cardinality of which will play no role 
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in our results. The boundedness assumption is a simplification that is justified by the 
fact that most of our results are vacuous for / ^ LqcJ/x]. In the remaining cases, we will 
mention optimal conditions on /. 

For any g e L^l/j] we write E[g] = j n gdp, and cr 2 [g] = E[(g — E[g]) 2 }. 

2.1. Thermal equilibrium and thermodynamic functions 

Our function / defines a one-parameter family {Epf. ft € R} of expectation functionals 
by 

E[ge f3f ] 

In statistical thermodynamics, Epf[g] is the thermal expectation of the observable g at 
temperature T = I /ft. The normalizing expectation is called the partition function, 

Z M = E[eM). 

The corresponding probability measure on f2, 

dfipf = Z~£c pf dfi, 

is called the canonical ensemble. It describes a system in thermal equilibrium with a heat 
reservoir at temperature T = 1/ ft. The canonical ensemble has the density p = Z^e^t, 
which maximizes the Kullback-Leibler divergence or relative entropy KL(pd/i,d^,) := 
E[phip], given the expected internal energy —E[pf]. The parameter ft is the Lagrange 
multiplier corresponding to this constraint. For a constant c we have the obvious and 
important identity Ep(t+ c -\ [g] =Epf[g\. 

The corresponding maximal value of the Kullback-Leibler divergence is the canonical 
entropy 

S f (ft)=KL(Z^yfd^d f i)=f3E„[f]-lnZ Pf . (2.1) 

Note that S-j(ft) = Sf(-ft), a simple but very useful fact to pass from upper to lower 
tails. 

For ft ^ the Hclmholtz free energy is defined by 

A f (ft) = ±lnZp f . 

Dividing (2.1) by ft and writing U = Epf[f], we obtain the classical thermodynamic 
relation 

A = U - TS, 

which describes the macroscopically available energy A as the difference between the total 
expected energy U and an energy portion TS, which is inaccessible due to ignorance of 
the microscopic state. 
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By L'Hopital's rule, we have lim^o Af(ft) = E[f], so the free energy Af extends 
continuously to K by setting Af(0) = E[f]. We find 

A' f (P) = ±E pf {f] -^\nZ pf = ft~ 2 S f (ft). 

Integrating this identity from zero to ft and multiplying with ft, we obtain: 

Theorem 1. For any ft > we have 



and, for t>0, 



Proof. 



]nE[e^- E ^]=ft T^d 7 
Jo 7 

Pr{/ - Ef >t}< exp (p J 



72 *r-#)- 



]nE[eW- E V) =hiZpf- ftE[f] = ft(A f (ft) - A f (0)) 

= ft [ A' f ( 7 )d 1 = ft [ ^d T . 
Jo Jo 1 

Combining this with Markov's inequality gives the second assertion. □ 

The theorem shows how bounds on the canonical entropy can lead to concentration 
results. In the following we present ways to arrive at such bounds. 

2.2. Entropy and energy fluctuations 

The thermal variance of a function g £ LooM is denoted a^(g) and defined by 

4 f (g) = E pf [(g - E Pf [g]f] = Ep f [g 2 ] ~ {Ep f [g]f. 

For constant c we have o~ 2 p^ +c ^[g\ = 

We first give some simple results pertaining to the derivatives of the partition function 
and the thermal expectations. 

Lemma 2. The following formulas hold: 

1. £QnZp f )=Ep f \f]. 

2. If h: ft i-)- h(ft) £ Loo[n] is differ entiable and (d/dft)h(ft) £L 00 [fj], then 
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3. ±Ep f [f k ] = E 0f {f k +i] E Pf [f k ]E 0f [f}. 

4. ^OnZ Pf ) = ^Ep,[f\ = a$ f \f\. 

Proof. 1 is immediate and 2 is a straightforward computation. 3 and 4 are immediate 
consequences of 1 and 2. □ 

The thermal variance of / itself corresponds to energy fluctuations. The next theorem 
represents entropy as a double integral of such fluctuations. The utility of this represen- 
tation to derive concentration results has been noted by David McAllester [18]. 

Theorem 3. We have for j3 > 

2 



Sf(P)= / / <rif[f]dsdt. 
Jo Jt 

Proof. Using the previous lemma and the fundamental theorem of calculus, we obtain 
the formulas 

PEpt if] = J E„ [/] dt = J (J a 2 sf [/] ds + E[f]j dt 

and 

In Z Pf = f Etf [/] dt = ( ( f a 2 sf [/] ds + E[f] 
Jo Jo \Jo 



which we subtract to obtain 



S f 09) = pEp f [/] - In Zf>f = f*(f* °*.f [/] **-f* °sf [/] ds ) d * 



a 2 sf [f]ds)dt. 



□ 



Since bounding [f] is central to our method, it is worth mentioning an interpretation 
in terms of heat capacity, or specific heat. Recall that —Ep;[f\ is the expected internal 
energy. The rate of change of this quantity with temperature T is the heat capacity. By 
conclusion 4 of Lemma 2 we have 



d , „ _ 1 a 



^(~E Pf [f]) = —aj f [f], 
which exhibits the proportionality of heat capacity and energy fluctuations. 



2.3. A variational entropy bound 



While Theorem 3 is just an elementary way of rewriting the canonical entropy, the 
following lemma is typically a strict inequality that leads to the modified logarithmic 
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Sobolev inequality proposed by Massart in [13]. To state it, we define the real function 

^(t)=e*-t-l, (2.2) 

which arises from deleting the first two terms in the power series expansion of the expo- 
nential function. 

Lemma 4. If c £ K, then 

Proof. Using hit < t — 1, we get 

13 f - In Z Pf =f3(f-c) + \n^-< f3(f - c) + (f- - 



Taking the thermal expectation then gives 



a 0C 



S f (J3)<Ep f \J3{f-c)] + — -l 

[ e -P(f-c) e 0f 

= E Pf \p{f-c)]+E 

= ^ / [e-W- c )+^(/-c)-l]. 



□ 



3. Product spaces 

We now assume that Q = YYk=i ^fc an d A 4 = ®L-=i Mfc, where each is the probability 
measure representing the distribution of some variable Xk in the space where all 
the Xj~ are assumed to be mutually independent. The Xk are irrelevant for the derivation 
of our inequalities, but they are convenient in the discussion of applications. 

If x = (xi, . . . , x n ) £ il describes a state of a physical system, we can think of Xk £ 
as the state of the fcth subsystem, which may be a particle or a more abstract object, such 
as a spin assigned to the vertex of a graph. The a priori measure \i assigns independent 
probabilities /i& to the states of the subsystems. If the total energy is a sum of energies 
of the subsystems, f = Ylfk> with £ L o[A 4 fc]j then this is also true for the canonical 
ensemble ZpfeP* dp corresponding to non-interaction of the subsystems. 



3.1. Conditional expectations 

For x£f2, 1 < fc < n and y G fife we use ~X- V ^ to denote the vector in f2, which is obtained 
by replacing Xk with y. We also write, for g £ Loo[/i], 

E k[g]{x) = / g(xy t k)dfj,k(y)= / g(xi,...,xk-i,y,x k+ u...,x n )dfj,k(y)- 
Jn k Jn k 
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The operator Ek corresponds to an expectation conditional to all variables with indices 
different to k. We denote with Ak the sub-algebra of Loo[/z] consisting of those functions 
that are independent of the fcth variable. Ek is evidently a linear projection onto Ak- 
Also, the Ek commute amongst each other and, for h G LooM and g G Ak, we have 

E[[E k h]g] = E[E k [hg}] = E[hg}. (3.1) 

Replacing the operator E by Ek leads to the definition of conditional thermodynamic 
quantities, all of which are now members of the algebra Ak- 

• the conditional partition function Zk,pf = i?fc[e"], 

• the conditional thermal expectation Ek t pf[g] = Z^^Ek\ge^^} for g G Loo[/i], 

• the conditional entropy Skj((3) = fiEk^f[f] — In Zk,pf, 

• the conditional free energy Akj(f3) = fi^ 1 \nZk,pf, 

• the conditional thermal variance a\ pf[g] = Ek,pf[(g — Ek y pf{g]) 2 ] for g G £00 As 
(3 — > 0, this becomes 

• the conditional variance o-\\g\ = Ek[(g — i?fc[g]) 2 ] for g G Loo[/z]. 

If we fix all variables except Xk, then Ek just becomes an ordinary expectation, and 
it becomes evident that all the previously established relations also hold for the corre- 
sponding conditional quantities; in particular, the conclusions of Theorem 3, 

SkM= f 4, sf [f}dsdt, 

Jo Jt 

and of Lemma 4, 

Sk,f(fi) < E k ^ f m-f3(f - fk))} if fk G Ak. 
Other members of Ak that will play a role in the sequel are: 

• the conditional supremum (sup fc g)(x) = sup ygr2fc g(x. Vi k) for g G Loo[/u], 

• the conditional infimum (inf^ <?)(x) = infj, e n fc <?( x y,fc) for g G £00 [m] and 

• the conditional range ran^g) = sup fe g — inf^g for g G Loo [/it]. 

3.2. Tensorization of entropy 

In the non- interacting case, when the energy function / is a sum, f = ^2fk, with fk G 
Loalfik], it is easily verified that 5fej(/3)(x) = Skj{/3) is independent of x and that 

n 
k=l 

Equality no longer holds in the interacting, nonlinear case, but there is a subadditivity 
property that is sufficient for the purpose of concentration inequalities. 

The tensorization inequality states that the total entropy is no greater than the thermal 
average of the sum of the conditional entropies. In 1975, Elliott Lieb [12] gave a proof of 
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this result, which was probably known some time before, at least in the classical setting 
relevant to our arguments. 

Lemma 5. Let h,g>0 be bounded measurable functions on Q. Then, for any expecta- 
tion E, 



m^<E 



ftln^ 
9 



Proof. Define an expectation functional E g by E g [h] = E[gh]/E[g]. The function <!>(i) 
tint is convex for positive t, since <£>" = 1/t > 0. Thus, by Jensen's inequality, 

Theorem 6. 





~h~ 


) < E[g]E g 














= E 


~hln±~ 




_g_ 






9. 



□ 



Sf(P)<Epf 



.k=l 



(3.3) 



Proof. We denote the canonical density with p, so p = e@f /Zpf, Writing p = p/E[p] as 
a telescopic product and using the previous lemma, we get 



E 



pin 



E[p] 



= E 



i TT Ei ■ ■ -E k -i[p] 
pin || 



= Y. E 



L^ El ... Ek _ lEk[p y 
E 1 ...E k ^[p]ln ^-^W 



E x ---E k _ x [E k [p]] 



pin 



EM 



= E 



E^ 



pin 



E k [p] 



From the definition of p, we then obtain 



S f (f3) = f3Ep f [f}-hiZ„ = E 

- n , 
= E E [Ek 



phi 



fc=i 



— — In — — 

iZpf Zpf 



E[ P ] 
E k 



<E 



ln^fe 



pin 



E k [ P ] 



Z 



0f 



k=l 



= E, 



Zj}"£E[E k [ePf]S ktf m = Z^E^tSkjifi)) by (3.1) 

k=l 

n 

!>,/(/?) . 



k=l 



□ 
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3.3. The Efron Stein— Steele inequality 



Combining (3.3) with Theorem 3 and dividing by /3 2 , we obtain 



W Jo I ^ dsdt ^ E ^f I <s f [f]d S dt 




Using the continuity properties of /3 i — >■ tr^[/], which follow from Lemma 2, we can take 
the limit as /3 — > and multiply by 2 to obtain 



a 



2 m<E 



k 



which is the well-known Efron-Stein-Steele inequality [21]. Observe that we may drop 
the assumption / 6 L 00 [)jl\ 1 but we still require the existence of exponential moments 
in an interval containing zero, so the inequality so derived is slightly weaker than the 
inequality in [21]. 

3.4. A modified logarithmic Sobolev inequality 

Suppose we have a sequence of functions fk 6 Ak, so that ft is independent of the kth 
coordinate. Combining (3.3) with Lemma 4 and using the identity E^fE^^f —Epf, we 
obtain 



which is the modified logarithmic Sobolev inequality proposed by Massart [13, 14]. Many 
consequences of this powerful inequality have been explored (e.g., [3-5, 13, 16]). Here we 
will concentrate on the consequences of combining the tensorization inequality with the 
fluctuation representation of entropy in Theorem 3. Since the fluctuation representation 
is an identity, this combination is stronger than (3.4) and leads to some results that 
apparently cannot be recovered from (3.4). We will also re-derive some results that can 
be derived from (3.4) in cases where we believe that the proposed method gives some 
additional insight. 

3.5. Conditional thermal variance and exponential concentration 

Theorems 1, 6 and 3 (applied to the conditional entropy) form the backbone of the 
proposed method. Combining them, we obtain the following generic concentration result: 

Theorem 7. For any (3 > we have the entropy bound 




(3.4) 
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the bound on the log-Laplace transform 




and the concentration inequality 

Pr{/ -Ef>t}< cxp U J d 7 - pt^j . 

The obvious strategy is to start by bounding the conditional thermal variance u\ s j [/] . 
Typically, this leads to considerable simplifications and we will follow this method in the 
sequel. 



4. Two classical concentration inequalities 

We begin with the derivation of two classical results: the bounded difference inequality 
and a similar result, which reduces to the familiar Bennett inequality when / is the sum 
of its arguments. These inequalities are not new, but they are very useful. We obtain them 
in their strongest forms and they provide a good illustration of our proposed method. 
For a, b G R, a < b define Ca,b : K -> K by 

Ca,b(t) = (b-t)(t-a). 

We state some elementary facts without proof. 



Lemma 8. (i) If X is a random variable with values in [a, b], then 

a 2 [X] <(b- EX){EX -a) = ( a , b (EX) < 
(ii) The function £ a .b is non-increasing in [(a + 6)/2, oo). 



4.1. The bounded difference inequality 

By Lemma 8(i), we get for all s € K. that a\ s f[f] < r a n fc(/)/4, so by the first conclusion 
of Theorem 7, 

^ ri rl 
S,(7)Sj/ o I E lt 

where we introduced the abbreviation R 2 (f) := X)l-=i ran fc(/)- Bounding the thermal 
expectation by the uniform norm, we obtain from the third conclusion of Theorem 7 



dsdt<j-E lf [R 2 (f)} 



(4.1) 



_fc=i 
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that for all (3 > 

Pr{/ -Ef>t}< cxp (b J P 5/( ^ 2 )d7 - < cxp ||i? 2 (/) |U - # 

Substitution of the minimizing value j3 = At/\\R 2 (f)\\ 00 gives 

P l{ /- JS />«J<e x p( piMjj -j, 

which is the well-known bounded difference inequality (with correct exponent) in the 
strong version given by McDiarmid [19], Theorem 3.7, where the supremum is outside of 
the sum of squared conditional ranges. Note that the result is vacuous for / ^ Loo[/i]. 

4.2. A Bennett— Bernstein concentration inequality 

The proof of the bounded difference inequality relied on bounding the thermal vari- 
ance <Tfc,/3/(/) uniformly in /3, using constraints on the conditional range of /. We now 
consider the case where we only use one constraint on the ranges, say / — Ek[f] < 1, but 
we use information on the conditional variances. This leads to a Bennett-type inequality 
as in [19], Theorem 3.8. To state it, we abbreviate the sum of conditional variances of / 
as 

s 2 (/)=E 4(f)- 

Again, we start with a bound on the thermal variance. 
Lemma 9. Assume f — Ef < 1. Then, for (3 > 0, 

<7^(/)<eV(/). 

Proof. 

°Pf(f) = 4u-Ef)(f - Ef) = E p y_ Ef) [(f - Ef) 2 } - (E w _ Ef) [f - Ef}) 2 

<f \(f rm E[(f-Ef)W-™] 

< Ep U -Ef)[(f -Ef)}- e[cPU -ei)] 

< E[(f — E f) 2 eP(f~ E ^} use Jensen on denominator 

< e p E[(f - Ef) 2 } use hypothesis. □ 

Next we bound the total entropy S/(/3). 
Lemma 10. Assume that f — Ekf < 1 for all k £ {1, . . . ,n}. Then, for j3 > 0, 

SM<<J3J-J + l)Ep f \&{f)\. 
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Proof. Using the first conclusion of Theorem 7 and the previous lemma, we get 



Sf(P) < E Pf 



n ,/3 ,0 ,p ,p 

E / / <sf[f}dsdt < / / e'dadt^/p 
fc=i ^° *'* ^° 



(/)] 



The conclusion follows from the elementary formula 

/ / e s dsdt = (fJ , -e t )dt = peP-J > + l. 
Jo Jt Jo 

Now we can prove our version of Bennett's inequality. 
Theorem 11. Assume f - E k f <l,Vfc. Lett>0 and denote V =\\E 2 (f)\\ 00 . Then 
Pr{f-E[f] >t} < exp(-V((l + tV- 1 )ln(l + tV- 1 ) -rt^ 1 )) 

" 6XP ( 27+^73)' 

Proof. Fix f3 > 0. Recall the definition of the function tp in (2.2) and observe that 
n/3 7e 7 - e 7 + 1 



□ 



/ 

-/() 



7" 



■d 7 = r 1 (c 3 -/3-i)=rV(/3), 



because (d^)^" 1 ^ - 1)) = 7-2(76^ - e 7 + 1) and lim 7 „ y0 7 _1 (e 7 - 1) = 1. Theorem 7 
and Lemma 10 combined with a uniform bound then give 



Jo 



1 



7 e 7 - c< + 1 



d 7 ||£ 2 (/)||oo = W)K 



So, by Markov's inequality, we have for any /? > that Pr{/ — E[f] > t} < exp(?/>(/3)V — 
fit). Substitution of j3 = ln(l + tV^ 1 ) gives the first inequality; the second is Lemma 2.4 
in [19]. □ 

Observe that / is assumed bounded above by the hypotheses of the theorem. The 
existence of exponential moments E[ePf] is needed only for j3 > 0, so the assumption 
/ G Loo [/•*] can be dropped in this case. 



5. Exploiting monotonicity 



Sometimes an appropriately chosen bound on the conditional thermal variance a\ s ^ [/] 
can be shown to have a monotonicity property in the variable s, which can be used to find 
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a bound uniform in the the region of integration. The remaining part of the fluctuation 
integral then just becomes /3 2 /2, which leads to sub-Gaussian tail estimates, just as for 
the bounded difference inequality. In this section, we give three examples. 



5.1. Functions with large conditional expectations 

The following is our hrst novel result, the proof of which is hardly more difficult than that 
of the bounded difference inequality. It depends on the assumption that the conditional 
expectations are consistently in the upper halves of the conditional ranges for all k and 
all configurations x\ , . . . , Xk-i, x k +i, ■ ■ ■ , x n of the conditioning data. If this condition is 
met, the result is much stronger than the bounded difference inequality, and, for large 
deviations t, also much stronger than Bennett's inequality. 



Theorem 12. Suppose that 



j? r ri ^ su Pfc f + inf k f wi. a ft \ 
E k [f\> z VfcG {!,..., n\ 



(5.1) 



and let 



Then 



A 



J2(sup f-E k [f])(E k [f}-Mf 



Pr{f-Ef>t}<e- t2 ^ 2A \ 

Proof. By Lemma 2, the function (3 n- Ek pf[f] is non-decreasing, so for j3 > we have 

"supfcZ + inffe/ 



Ek[f] G 



-,E k Mf] 



Since the function £i n f fc /, S up fc / (of Lemma 8) is non-increasing in this interval, we get 

Vk,Pf(f) < Cinf k /,sup fc /(-£*,£/[/]) <Cinf fc f,sup k f(Ek{f]), 

and, from the first conclusion of Theorem 7, 



sm<E PJ 
p 2 



E 

,k=i 



P rP 



o Jt 



'k,sf 



[f]dsdt 



< 



(Eklf] 



fc=l 



^ ^-E f3f 



p 2 A 



Lfc=l 



inf fe /,sup fc / 



<E k [f}) 



The result now follows as in the proof of the bounded difference inequality. 



□ 



14 A. Maurer 

5.2. Monotonicity of variational bounds on the thermal variance 

A related strategy first finds a simple variational bound on the conditional thermal 
variance. We have 

a 2 [g] = mmE[(g - t) 2 } < E[(g - c) 2 } Vc € E. 
Applied to the conditional thermal variance, this translates to 

4,0f[f]<EkM(f-fk) 2 } Vf k eA k . (5.2) 
We will use inf k / for /& and combine this observation with the following. 

Proposition 13. The function (3 Ek t pf[(f — fifi/ f) 2 ] is non- decreasing. 

Proof. Write h = f — inffc / and define a real function £ by £(t) = (max{i,0}) 2 . Since 
h > 0, we have 



/-inf/ 



E k,pf 

By Lemma 2, we obtain 
d 



= E. 



fe,/3(/-inf fc /) 



(/-inf/) 2 ] =E kJjh m)}. 



d(3 



Ep h [£(h)} = - E ph [i{h)]E 0h [h] > 0, 



where the last inequality uses the well-known fact that for any expectation E[^(h)h] > 
E[£{h)]E[h] whenever £ is a non-decreasing function. □ 

A first consequence is a lower tail bound somewhat similar to Bernstein's inequality, 
Theorem 11. 

Theorem 14. Let t > and denote 



W ■ 



Then 



Pr{E[f]-f>t}<exp(-^j. 



Proof. We use inequality (5.2) and Proposition 13 to get for s > 



4,- sf lf]<E, 



k-sf 



/-inf/ 



<E k 



/-inf/ 
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We therefore obtain from Theorem 7 

r rP 



S- f (i3)<E^ f 
[3 2 W 



E 

,k=i 



>k,-sf 



[f]dsdt 



P 2 



£ 25* (/-inf/ 



< 



and then proceed as in the proof of the bounded difference inequality. 



□ 



If we take the function / to be an average of real random variables, then Theorem 14 
reduces to an inequality given in [20] and [15]. In [15] it is argued that for very hetero- 
geneous variables this inequality is superior to Bernstein's inequality. Similar arguments 
apply to the present, more general case. 

When we apply the same method to obtain upper tail bounds we arrive at a surprisingly 
powerful result. To state it, we introduce worst-case variance proxies, which will play an 
important role in the sequel. 

Definition 1. Let g G [/_*]. The worst-case variance proxy of g is the function Dg £ 
Loo[fA defined by 

2 



D 9 = E .9-mf 



The function Dg is a local measure of the sensitivity of g to modifications of its 
individual arguments. 

Lemma 15. We have, for f3 > 0, 

Sm<^-E pf [Df\. 
Proof. We use inequality (5.2) and Proposition 13 to get for < s < j3 



o-L f {f]<E k , sf 
So, using Theorem 7 again, 



/-inf/ 



<E, 



Sf(P)<Epj 



E 



jfe=i 



*ls f lf}dsdt 



k,Pf 



(3 2 

- Y f3f 



/-inf/ 



E^m/ (/- inf / 



k=l 



02 F 

Y E Pf 



E(/-¥/ 



.fe=i 



where we used the identity EpfE^pf = E@f in the last equation. 



□ 
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The usual arguments now immediately lead to the following. 
Theorem 16. With t>0, 

-t 2 



Pr{/ - E[f] >t}< exp 



2||Af||c 



In [16] this result was derived from inequality (3.4), and it is shown that it improves 
the exponent on upper tail bounds derived from Talagrand's convex distance inequality 
in many cases, as for shortest travelling salesmen paths, Stciner trees and the eigenvalues 
of random symmetric matrices. Here we only give one example of how ||Z?/||oo may be 
bounded and consider a convex Lipschitz function / defined on the cube [0,1]™. For 
simplicity, we assume / to be differcntiable. 

Let x € [0, 1]™ and suppose that for some fixed k there is y € [0, 1] such that f{x. Vik ) < 
/(x). Then by convexity (using really only the fact that / is separately convex in each 
coordinate), 

/(x) - /(x,, fe ) < <x - x B , fe ,5/(x)) K n = (a* - y)d k f(x) < |d fc /(x)|. 
We therefore have /(x) — inf y f(x y ^) < \dkf(x)\ and 

B/(x)=X;(/(x)-iBf/(x Wlfc )) < I|d/(X)||£„ < H/llLp. 

In combination with Theorem 16 we obtain upper tail bounds for / with an exponent 
twice as good as obtained from the convex distance inequality [11], Corollary 4.10, or an 
earlier application of the entropy method [11], Theorem 5.9. 

For a corresponding lower tail bound, we have to use an estimate similar to what was 
used in the proof of Bennett's inequality. 

Lemma 17. If f - inf fe / < 1, Vfc, then for j3 > 0, 

S-f{fi)<^{fi)E- Pf [Df\, 

with ijj defined as in (2.2). 

Proof. Let k 6 {1, . . . , n}. We write h k := / - inf fe /. Then h k G [0, 1] and for s < (3 



2 gjJ^e-^eC^ <e (P- s) mg£^ Rhlh2] 



We therefore have 



f f E k ,_ sf [h 2 k }dsdt= f f E k ^ shk [hl]dsdt 
Jt Jo Jt 

< ny^eP- s dsdt^E k ^ h M = mE k ,-pf[hl], 
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P rP 



e s ds dt = 1 - e 



(3e~ 



Thus, using Theorem 7 and the identity E-pfEk^pf = E-pf, 



S- f (J3) < E_p f 



E 



P r P 



>k,-sf 



[f]dsdt 



<E. 



PS 



rP rP 

/ / E k ,- af [hl]dsdt 



< W)E-PS 



^)E-Pf[Df] 



Lemmas 15 and 17 together with Theorem 1 imply the inequalities 

lnE[c^ f - E ^} < t f E lS [D]\d-i 
2 Jo 

and, if / — inf^, / < 1 for all k, then 

mS[e ^[/]-/)]<ffl E_ lf[D f]d 7 , 
P Jo 



□ 



(5.3) 



(5.4) 



where in the last inequality we also used the fact that 7 1— > ip(l)/l 2 is non-decreasing. 
Bounding the thermal expectation with the uniform norm and substitution of = ln(l + 
tll-D/H^o 1 ) gives the following lower tail bound that can also be found in [16]. 



Theorem 18. If f — inffc / < 1 for all k, then for t>0, 

t 



Pr{£/-/>t}<cxp -\\Df 



< exp 



1 



\Df\ 



In 1 + 



\Df\ 



\Df\ 



-t 2 



2||£>/|| 00 + 2i/3 



The two inequalities (5.3) and (5.4) are the keys to obtaining concentration inequalities 
in terms of the worst-case variance proxy Df . Both results can also be deduced from 
Massart's inequality (3.4) as shown in [16]. We do not claim that the derivations given 
above are per se superior. We presented them because they follow the same principles 
as the proofs of the other results given above (the bounded difference inequality and 
Theorems 11, 12 and 14), which do not follow from inequality (3.4). 



6. Self-boundedness and canonical decoupling 

We conclude by presenting two general principles to extend the utility of the proposed 
method. All the above applications of Theorem 7 involved a chain of inequalities of the 
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form 



S s f(j)<E Elf 



J J 4,esf[f]^dt 



<^)E e ^[G(f)}, 



where e = 1 for upper tail results and e = — 1 for lower tail results, £ is some non- negative 
real function and G(f) is some function on Q derived from /. For the bounded difference 
inequality, for example, £(7) = 7 2 /8 and G = R 2 (f); for the Bennett inequality £(7) = 
7e 7 - e 7 + 1 and G(f) = S 2 (/); for Theorem 16 we had £(7) = 7 2 /2 and G(f) = Df; 
while for the corresponding lower tail bound, Theorem 18, we had £(7) = "0(7) an d also 
G(f) = Df, etc. Theorem 7 is then invoked to conclude that 



Inl^W-*/) <pf M EEif[G{f)] d7 < /?||G(/)|| C 
Jo 1 



e(7)d7 (6.1) 



Here the uniform estimate E E pf[G(f)] < ||G(/)||oo, while being very simple, is some- 
what loose. We now sketch how it can sometimes be avoided by exploiting special prop- 
erties of the thermal expectation. 

6.1. Self-boundedness 

The first possibility we consider is that the function G(/) can be bounded in terms 
of the function / itself, a property referred to as self-boundedness [4]. For example, if 
G(f) < f, then E lf [G{f)] < E lf [f] = (d/d 7 ) ln^[exp( 7 /)], and if the function £ has some 
reasonable behavior, then the first integral in (6.1) above can be bounded by partial 
integration or even more easily. As an example, we apply this idea in the setting of 
Theorems 16 and 18. 

Theorem 19. Suppose that there are non-negative numbers a, b such that Df < af + b. 
Then, for t>0, we have 



Pr{/ - E[f] >t}< exp 



-t z 



2{aE[f] + b + at/2) 
If in addition, a > 1 and f — inffc / < l,Vfc G {1, . . . ,n}, then 

-t 2 



Pr{£[/] - / > t} < exp 



2(aE[f] + b) 



Proof. We only prove the lower tail bound; for the upper tail we refer to [16]. As for 
the lower tail, it follows from (5.4) and Lemma 2 that 

KEtflW-f)] f E . lf [f] d 7 + W) = ^ Z -M + W) 







-^MlnE[e^ E ^) +m(*E[f] +&)■ 
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Rearranging gives 

where one verifies that for j3 > and a > 1 we have ^(/3)(1 + a/3 _1 V'(/3)) _1 < /3 2 /2. The 
usual analysis with Markov's inequality and optimization in /3 conclude the proof. □ 

Recently Boucheron et al. [4] have given a refined version of this result, where the 
condition a > 1 is improved to a > 1/3 for the lower tail. There they also show that 
Theorems 19 and 16 together suffice to derive a version of the convex distance inequality 
that differs from Talagrand's original result only in that it has an inferior exponent. 

It must be stressed that the same method of proof can be used to yield self-bounded 
versions of all concentration inequalities derived from Theorem 7, such as the bounded- 
difference and Bennett inequalities. 

6.2. Decoupling 

A second method to avoid the uniform bound on the thermal expectation uses decoupling. 
Recall that for any two probability measures v and \i and a measurable function g we 
have 

which can be regarded as an instance of convex duality and easily verified directly from 
the definition of the Kullback-Lciblcr divergence. Applying this inequality when v is the 
canonical ensemble and fi is the a priori measure, we obtain for any 9 > 

S s f{P)<m^ 1 E s pf[eG{f)]<^~\S £f {P)+\nE[eMeG{f))]). 

For values of (3 and 9 where 9 > £(/?) we obtain 

S Ef ((3)<^^\nE[cM9G(f))}. 

Hence, if we can control the upwards deviations of G(f) (or some suitable bound thereof), 
we obtain concentration inequalities for / in terms of the expectation of G(f) (or the 
bound thereof). Again, this method, which was proposed in [3], can be applied to all 
the versions of G(f) we introduced above and combined with all methods to control the 
upwards deviation of G(f), which leads to a proliferation of concentration inequalities. 
Perhaps not all of these deserve to be documented. We just quote a corresponding result 
in [17] that uses G(f) = Df and combines with self-boundedncss. 

Theorem 20. Suppose that there is g £ Loo[m] o,nd a > 1 such that < / < g, Df < ag 
and Dg < ag. Then, for t > 0, 

Pr{/ - Ef >t}< exp 



4aE[g] + 3at/2 
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If, in addition, f — mi k / < 1 for all k, then 

Pr{Ef - f > t} < cxp ( ^ ^ . 

In [17] the theorem is used to show that the concentration of eigenvalues (/) of the 
Gram matrix of a sample of independent, bounded random vectors in a Hilbert space is 
controlled by the size of the largest eigenvalue (<?). 



7. A glossary of notation 



We conclude with a tabular summary of notation. 



n 



x k 

f e Loo M 

a 2 [g]=E[(g-E[g}f] 
P = l/T 

E pf [g} = E[ge^}/E[e^} 



Z 



73/ 



■.E\ePt] 



S f (f3)=f3Ep f [f]-\nZ pf . 



7 Pf 



(<?) 



■E P fl(g 

3* - t - 1 



■EfilW 



m 

E k[g}(*) = Ia k 9( x v,k) d/i fe (y) 
A k C Loo \p] 
Z k ,/)f = E k [ef>f] 
Ek,p f [g}=Z^} f E k {gePf] 
Skjifl) =/3E ki pf\g] - lnZk,f3f 
vl.p f [9] = E k,pf [(9 - E k ,i3f [g]) [ 
al[g]=E k [(g-Ek[g}) 2 } 
(sup fc5 )(x) = sup yenk g(x y , k ) 
(inf fe .g)(x) = mf ye n k g(x y , k ) 
ran fc (g) = sup fc g - inf fc g 

R 2 (g) = J2 k ^t(9) 
£ 2 (.9)=£ fc ^.[5] 

D 9 = T l k(9-^hg) 2 



underlying (product-) probability space. 

(product-) probability measure on CI. 

random variable distributed as fi k in Cl k . 

fixed function (negative energy) under investigation. 

generic function. 

expectation of g in \i. 

variance of g in /j,. 

inverse temperature. 

thermal expectation of g. 

partition function. 

canonical entropy. 

Hclmholtz free energy. 

thermal variance of g. 

vector x € CI with Xk replaced by y €Clk- 

conditional expectation. 

functions independent of fcth variable. 

conditional partition function. 

conditional thermal expectation. 

conditional entropy. 

conditional thermal variance. 

conditional variance. 

conditional supremum. 

conditional infimum. 

conditional range. 

sum of conditional square ranges. 

sum of conditional variances. 

worst case variance proxy. 
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